{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Writing Custom Dataset Exporters\n",
    "\n",
    "This recipe demonstrates how to write a [custom DatasetExporter](https://voxel51.com/docs/fiftyone/user_guide/export_datasets.html#custom-formats) and use it to export a FiftyOne dataset to disk in your custom format."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "If you haven't already, install FiftyOne:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install fiftyone"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this recipe we'll use the [FiftyOne Dataset Zoo](https://docs.voxel51.com/dataset_zoo/index.html) to download the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) to use as sample data to feed our custom exporter.\n",
    "\n",
    "Behind the scenes, FiftyOne uses either the\n",
    "[TensorFlow Datasets](https://www.tensorflow.org/datasets) or\n",
    "[TorchVision Datasets](https://pytorch.org/vision/stable/datasets.html) libraries to wrangle the datasets, depending on which ML library you have installed.\n",
    "\n",
    "You can, for example, install PyTorch as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install torch torchvision"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing a DatasetExporter\n",
    "\n",
    "FiftyOne provides a [DatasetExporter](https://voxel51.com/docs/fiftyone/api/fiftyone.utils.data.html#fiftyone.utils.data.exporters.DatasetExporter) interface that defines how it exports datasets to disk when methods such as [Dataset.export()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.html#fiftyone.core.dataset.Dataset.export) are used.\n",
    "\n",
    "`DatasetExporter` itself is an abstract interface; the concrete interface that you should implement is determined by the type of dataset that you are exporting. See [writing a custom DatasetExporter](https://voxel51.com/docs/fiftyone/user_guide/export_datasets.html#custom-formats) for full details.\n",
    "\n",
    "In this recipe, we'll write a custom [LabeledImageDatasetExporter](https://voxel51.com/docs/fiftyone/api/fiftyone.utils.data.html#fiftyone.utils.data.exporters.LabeledImageDatasetExporter) that can export an image classification dataset to disk in the following format:\n",
    "\n",
    "```\n",
    "<dataset_dir>/\n",
    "    data/\n",
    "        <filename1>.<ext>\n",
    "        <filename2>.<ext>\n",
    "        ...\n",
    "    labels.csv\n",
    "```\n",
    "\n",
    "where `labels.csv` is a CSV file that contains the image metadata and associated labels in the following format:\n",
    "\n",
    "```\n",
    "filepath,size_bytes,mime_type,width,height,num_channels,label\n",
    "<filepath>,<size_bytes>,<mime_type>,<width>,<height>,<num_channels>,<label>\n",
    "<filepath>,<size_bytes>,<mime_type>,<width>,<height>,<num_channels>,<label>\n",
    "...\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's the complete definition of the `DatasetExporter`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "import os\n",
    "\n",
    "import fiftyone as fo\n",
    "import fiftyone.utils.data as foud\n",
    "\n",
    "\n",
    "class CSVImageClassificationDatasetExporter(foud.LabeledImageDatasetExporter):\n",
    "    \"\"\"Exporter for image classification datasets whose labels and image\n",
    "    metadata are stored on disk in a CSV file.\n",
    "\n",
    "    Datasets of this type are exported in the following format:\n",
    "\n",
    "        <dataset_dir>/\n",
    "            data/\n",
    "                <filename1>.<ext>\n",
    "                <filename2>.<ext>\n",
    "                ...\n",
    "            labels.csv\n",
    "\n",
    "    where ``labels.csv`` is a CSV file in the following format::\n",
    "\n",
    "        filepath,size_bytes,mime_type,width,height,num_channels,label\n",
    "        <filepath>,<size_bytes>,<mime_type>,<width>,<height>,<num_channels>,<label>\n",
    "        <filepath>,<size_bytes>,<mime_type>,<width>,<height>,<num_channels>,<label>\n",
    "        ...\n",
    "\n",
    "    Args:\n",
    "        export_dir: the directory to write the export\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, export_dir):\n",
    "        super().__init__(export_dir=export_dir)\n",
    "        self._data_dir = None\n",
    "        self._labels_path = None\n",
    "        self._labels = None\n",
    "        self._image_exporter = None\n",
    "        \n",
    "    @property\n",
    "    def requires_image_metadata(self):\n",
    "        \"\"\"Whether this exporter requires\n",
    "        :class:`fiftyone.core.metadata.ImageMetadata` instances for each sample\n",
    "        being exported.\n",
    "        \"\"\"\n",
    "        return True\n",
    "\n",
    "    @property\n",
    "    def label_cls(self):\n",
    "        \"\"\"The :class:`fiftyone.core.labels.Label` class(es) exported by this\n",
    "        exporter.\n",
    "\n",
    "        This can be any of the following:\n",
    "\n",
    "        -   a :class:`fiftyone.core.labels.Label` class. In this case, the\n",
    "            exporter directly exports labels of this type\n",
    "        -   a list or tuple of :class:`fiftyone.core.labels.Label` classes. In\n",
    "            this case, the exporter can export a single label field of any of\n",
    "            these types\n",
    "        -   a dict mapping keys to :class:`fiftyone.core.labels.Label` classes.\n",
    "            In this case, the exporter can handle label dictionaries with\n",
    "            value-types specified by this dictionary. Not all keys need be\n",
    "            present in the exported label dicts\n",
    "        -   ``None``. In this case, the exporter makes no guarantees about the\n",
    "            labels that it can export\n",
    "        \"\"\"\n",
    "        return fo.Classification\n",
    "\n",
    "    def setup(self):\n",
    "        \"\"\"Performs any necessary setup before exporting the first sample in\n",
    "        the dataset.\n",
    "\n",
    "        This method is called when the exporter's context manager interface is\n",
    "        entered, :func:`DatasetExporter.__enter__`.\n",
    "        \"\"\"\n",
    "        self._data_dir = os.path.join(self.export_dir, \"data\")\n",
    "        self._labels_path = os.path.join(self.export_dir, \"labels.csv\")\n",
    "        self._labels = []\n",
    "        \n",
    "        # The `ImageExporter` utility class provides an `export()` method\n",
    "        # that exports images to an output directory with automatic handling\n",
    "        # of things like name conflicts\n",
    "        self._image_exporter = foud.ImageExporter(\n",
    "            True, export_path=self._data_dir, default_ext=\".jpg\",\n",
    "        )\n",
    "        self._image_exporter.setup()\n",
    "        \n",
    "    def export_sample(self, image_or_path, label, metadata=None):\n",
    "        \"\"\"Exports the given sample to the dataset.\n",
    "\n",
    "        Args:\n",
    "            image_or_path: an image or the path to the image on disk\n",
    "            label: an instance of :meth:`label_cls`, or a dictionary mapping\n",
    "                field names to :class:`fiftyone.core.labels.Label` instances,\n",
    "                or ``None`` if the sample is unlabeled\n",
    "            metadata (None): a :class:`fiftyone.core.metadata.ImageMetadata`\n",
    "                instance for the sample. Only required when\n",
    "                :meth:`requires_image_metadata` is ``True``\n",
    "        \"\"\"\n",
    "        out_image_path, _ = self._image_exporter.export(image_or_path)\n",
    "\n",
    "        if metadata is None:\n",
    "            metadata = fo.ImageMetadata.build_for(image_or_path)\n",
    "\n",
    "        self._labels.append((\n",
    "            out_image_path,\n",
    "            metadata.size_bytes,\n",
    "            metadata.mime_type,\n",
    "            metadata.width,\n",
    "            metadata.height,\n",
    "            metadata.num_channels,\n",
    "            label.label,  # here, `label` is a `Classification` instance\n",
    "        ))\n",
    "\n",
    "    def close(self, *args):\n",
    "        \"\"\"Performs any necessary actions after the last sample has been\n",
    "        exported.\n",
    "\n",
    "        This method is called when the exporter's context manager interface is\n",
    "        exited, :func:`DatasetExporter.__exit__`.\n",
    "\n",
    "        Args:\n",
    "            *args: the arguments to :func:`DatasetExporter.__exit__`\n",
    "        \"\"\"\n",
    "        # Ensure the base output directory exists\n",
    "        basedir = os.path.dirname(self._labels_path)\n",
    "        if basedir and not os.path.isdir(basedir):\n",
    "            os.makedirs(basedir)\n",
    "\n",
    "        # Write the labels CSV file\n",
    "        with open(self._labels_path, \"w\") as f:\n",
    "            writer = csv.writer(f)\n",
    "            writer.writerow([\n",
    "                \"filepath\",\n",
    "                \"size_bytes\",\n",
    "                \"mime_type\",\n",
    "                \"width\",\n",
    "                \"height\",\n",
    "                \"num_channels\",\n",
    "                \"label\",\n",
    "            ])\n",
    "            for row in self._labels:\n",
    "                writer.writerow(row)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating a sample dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to use `CSVImageClassificationDatasetExporter`, we need some labeled image samples to work with.\n",
    "\n",
    "Let's use some samples from the test split of CIFAR-10:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split 'test' already downloaded\n",
      "Loading 'cifar10' split 'test'\n",
      " 100% |███| 10000/10000 [4.4s elapsed, 0s remaining, 2.2K samples/s]      \n"
     ]
    }
   ],
   "source": [
    "import fiftyone.zoo as foz\n",
    "\n",
    "num_samples = 1000\n",
    "\n",
    "#\n",
    "# Load `num_samples` from CIFAR-10\n",
    "#\n",
    "# This command will download the test split of CIFAR-10 from the web the first\n",
    "# time it is executed, if necessary\n",
    "#\n",
    "cifar10_test = foz.load_zoo_dataset(\"cifar10\", split=\"test\")\n",
    "samples = cifar10_test.limit(num_samples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset:        cifar10-test\n",
      "Num samples:    1000\n",
      "Tags:           ['test']\n",
      "Sample fields:\n",
      "    filepath:     fiftyone.core.fields.StringField\n",
      "    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n",
      "    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n",
      "Pipeline stages:\n",
      "    1. Limit(limit=1000)\n"
     ]
    }
   ],
   "source": [
    "# Print summary information about the samples\n",
    "print(samples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Sample: {\n",
      "    'dataset_name': 'cifar10-test',\n",
      "    'id': '5f0e6d7f503bf2b87254061c',\n",
      "    'filepath': '~/fiftyone/cifar10/test/data/000001.jpg',\n",
      "    'tags': BaseList(['test']),\n",
      "    'metadata': None,\n",
      "    'ground_truth': <Classification: {'label': 'cat', 'confidence': None, 'logits': None}>,\n",
      "}>\n"
     ]
    }
   ],
   "source": [
    "# Print a sample\n",
    "print(samples.first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exporting a dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With our samples and `DatasetExporter` in-hand, exporting the samples to disk in our custom format is as simple as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Exporting 1000 samples to '/tmp/fiftyone/custom-dataset-exporter'\n",
      " 100% |█████| 1000/1000 [1.0s elapsed, 0s remaining, 1.0K samples/s]          \n"
     ]
    }
   ],
   "source": [
    "export_dir = \"/tmp/fiftyone/custom-dataset-exporter\"\n",
    "\n",
    "# Export the dataset\n",
    "print(\"Exporting %d samples to '%s'\" % (len(samples), export_dir))\n",
    "exporter = CSVImageClassificationDatasetExporter(export_dir)\n",
    "samples.export(dataset_exporter=exporter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's inspect the contents of the exported dataset to verify that it was written in the correct format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 168\r\n",
      "drwxr-xr-x     4 voxel51  wheel   128B Jul 14 22:46 \u001b[34m.\u001b[m\u001b[m\r\n",
      "drwxr-xr-x     3 voxel51  wheel    96B Jul 14 22:46 \u001b[34m..\u001b[m\u001b[m\r\n",
      "drwxr-xr-x  1002 voxel51  wheel    31K Jul 14 22:46 \u001b[34mdata\u001b[m\u001b[m\r\n",
      "-rw-r--r--     1 voxel51  wheel    83K Jul 14 22:46 labels.csv\r\n"
     ]
    }
   ],
   "source": [
    "!ls -lah /tmp/fiftyone/custom-dataset-exporter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 8000\r\n",
      "drwxr-xr-x  1002 voxel51  wheel    31K Jul 14 22:46 .\r\n",
      "drwxr-xr-x     4 voxel51  wheel   128B Jul 14 22:46 ..\r\n",
      "-rw-r--r--     1 voxel51  wheel   1.4K Jul 14 22:46 000001.jpg\r\n",
      "-rw-r--r--     1 voxel51  wheel   1.3K Jul 14 22:46 000002.jpg\r\n",
      "-rw-r--r--     1 voxel51  wheel   1.2K Jul 14 22:46 000003.jpg\r\n",
      "-rw-r--r--     1 voxel51  wheel   1.2K Jul 14 22:46 000004.jpg\r\n",
      "-rw-r--r--     1 voxel51  wheel   1.4K Jul 14 22:46 000005.jpg\r\n",
      "-rw-r--r--     1 voxel51  wheel   1.3K Jul 14 22:46 000006.jpg\r\n",
      "-rw-r--r--     1 voxel51  wheel   1.4K Jul 14 22:46 000007.jpg\r\n"
     ]
    }
   ],
   "source": [
    "!ls -lah /tmp/fiftyone/custom-dataset-exporter/data | head -n 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "filepath,size_bytes,mime_type,width,height,num_channels,label\r\n",
      "/tmp/fiftyone/custom-dataset-exporter/data/000001.jpg,1422,image/jpeg,32,32,3,cat\r\n",
      "/tmp/fiftyone/custom-dataset-exporter/data/000002.jpg,1285,image/jpeg,32,32,3,ship\r\n",
      "/tmp/fiftyone/custom-dataset-exporter/data/000003.jpg,1258,image/jpeg,32,32,3,ship\r\n",
      "/tmp/fiftyone/custom-dataset-exporter/data/000004.jpg,1244,image/jpeg,32,32,3,airplane\r\n",
      "/tmp/fiftyone/custom-dataset-exporter/data/000005.jpg,1388,image/jpeg,32,32,3,frog\r\n",
      "/tmp/fiftyone/custom-dataset-exporter/data/000006.jpg,1311,image/jpeg,32,32,3,frog\r\n",
      "/tmp/fiftyone/custom-dataset-exporter/data/000007.jpg,1412,image/jpeg,32,32,3,automobile\r\n",
      "/tmp/fiftyone/custom-dataset-exporter/data/000008.jpg,1218,image/jpeg,32,32,3,frog\r\n",
      "/tmp/fiftyone/custom-dataset-exporter/data/000009.jpg,1262,image/jpeg,32,32,3,cat\r\n"
     ]
    }
   ],
   "source": [
    "!head -n 10 /tmp/fiftyone/custom-dataset-exporter/labels.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleanup\n",
    "\n",
    "You can cleanup the files generated by this recipe by running:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -rf /tmp/fiftyone"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
